Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step.
Remove ' months' suffix
Remove bad data
Map FICO.Range to single int
In [329]:
import pandas as pd
from numpy import nan as NA
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData
Out[329]:
In [330]:
loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate
Out[330]:
In [331]:
ir = loansData['Interest.Rate']
ir[0:5]
Out[331]:
In [332]:
type(ir[0:1])
irbak = ir
type(irbak)
loansDataBak = loansData
loansDataBak.reset_index()
loansDataBak
loansDataBak['Interest.Rate']
loansData
Out[332]:
In [333]:
ldb = loansDataBak.reset_index() # explain
ldb['Interest.Rate'][0:5]
irates = ldb['Interest.Rate'][0:]
type(irates)
[irates[k] for k in (0,1,2,3,4)]
#int_rates = [ float(x[:-1]) for x in irates[k] ]
#"""
#---------------------------------------------------------------------------
#ValueError Traceback (most recent call last)
#<ipython-input-46-5c4f2333efd1> in <module>()
# 5
# 6 [irates[k] for k in (0,1,2,3,4)]
#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]
#
#ValueError: could not convert string to float:
#"""
Out[333]:
In [334]:
type(ldb)
Out[334]:
In [335]:
srates = ldb['Interest.Rate']
In [336]:
#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'
In [337]:
nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'
In [338]:
len(nas)
Out[338]:
In [339]:
srates[0][:-1]
Out[339]:
In [340]:
float(srates[0][:-1])
Out[340]:
In [341]:
nopct = [ x[:-1] for x in srates ]
In [342]:
flrates = [float(x) for x in nopct]
In [343]:
flrates[0:5]
Out[343]:
In [344]:
flrate = map(float, nopct)
In [345]:
flrate[0:5]
Out[345]:
In [346]:
len(flrate)
Out[346]:
In [347]:
ldb
Out[347]:
In [348]:
ldb['Interest.Rate'] = flrate
In [349]:
ldb[0:5]
Out[349]:
In [350]:
ldb['Interest.Rate'][0:5]
Out[350]:
In [351]:
srates = loansData['Interest.Rate']
In [352]:
nopct = [ x[:-1] for x in srates ]
In [353]:
flrates = [float(x) for x in nopct]
In [354]:
rates = [float(x[:-1]) for x in srates] # use this
In [355]:
flrates == rates
Out[355]:
In [356]:
loansData['Interest.Rate'] = flrates
In [357]:
loansData['Interest.Rate'][0:5]
Out[357]:
In [358]:
withmons = ldb['Loan.Length']
In [358]:
In [359]:
wmons = withmons[0:]
In [360]:
wmons[0:5]
Out[360]:
In [361]:
wmons
Out[361]:
In [362]:
wmons[0].split()
Out[362]:
In [363]:
wmons[0].split()[0]
Out[363]:
In [364]:
int(wmons[0].split()[0])
Out[364]:
In [365]:
x = wmons[0].split()
In [366]:
x[0]
Out[366]:
In [367]:
int(x[0])
Out[367]:
In [368]:
intmons = [ int(x.split()[0]) for x in wmons ]
In [369]:
intmons[0:10]
Out[369]:
In [370]:
loansData['Loan.Length']
Out[370]:
In [371]:
loansData['Loan.Length'].value_counts()
Out[371]:
In [372]:
loansData['Monthly.Income'].describe()
Out[372]:
First remove implausible values. We see the max value to be 102750.
This is a MONTHLY income of 100K dollars, which is certainly possible, but ....
highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context.
In [373]:
loansData['Monthly.Income'].idxmax() # find the place where the max occurs
Out[373]:
But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe.
In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars.
In [374]:
loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]
Out[374]:
In [374]:
In [375]:
loansData['Monthly.Income'].describe()
Out[375]:
In [376]:
ldlt100 = ldb[ldb['Monthly.Income'] < 100000]
In [377]:
ldlt100
Out[377]:
In [378]:
len(ldlt100)
Out[378]:
Now drop any rows that have 'NA' values ie data not available.
In database terminiology these would be 'NULL' values.
In [379]:
ldb2 = ldlt100.dropna()
In [380]:
ldb2
Out[380]:
In [381]:
len(ldb2)
Out[381]:
So we dropped one row that had an NA value somewhere.
Now that we have removed the "bad" data, let's take on the final data-cleaning task for this data set - converting the ranges to single integers.
Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724.
We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.
In [382]:
ficostr = ldb2['FICO.Range']
In [383]:
ficostr[0:10]
Out[383]:
In [384]:
ficostr[0]
Out[384]:
In [385]:
ficoint = [ int(x.split('-')[0]) for x in ficostr ]
In [386]:
ficoint[0:10]
Out[386]:
In [387]:
ldb2['FICO.Range'] = ficoint
In [388]:
len(ficoint)
Out[388]:
In [389]:
ldb2['FICO.Range']
Out[389]:
Now we have a dataset that we can use for our data exploration and analysis
In [389]: